Each observation belongs to a category:
Observations take numerical values that represent different magnitudes of the variable:
The two main differences between a bar chart and a histogram are:
When we convert from continuous to intervals, what is the type of the new variable?
Because the sample standard deviation is computed as an approximation of the real standard deviation, from a sample of the population.
This means that the data point we get are more likely to be around the mean and less likely to be on the tails of the distribution.
So the sample standard deviation always underestimates the real value.
For this reason we decrease the denominator and overshoot the number.
The Quartiles split the distribution into four parts that have the same number of observations:
You can find the quartiles by:
Example:
[30, 35, 53], [55, 57, 57], [60, 61, 64], [71, 78, 90]
The mean, the quartiles, and the standard deviation all reduce by 5%:
We do the same thing for some reason:
Range and IRQ.
I can't do all exercises, from here i just categorize them
Imagine need to find that line:
We don't have to bruteforce our way to the best-fitting line.
We already know that a line can be described with the following formula:
We find r (correlation coefficient):
We find the covariance:
Then:
We find
We find q using the mean values, because we are sure that they are on the regression line:
If the exercise asks you to motivate why the regression line fits the data well, you say that
It is the collection of all possible outcomes of an experiment.
A box contains four balls: one red, one blue, one yellow and one pink.
We are in the same sample space as the examples above:
When we are searching for the probability of an event A given that another event B has already happened, we can restrict the sample space to the event B.
Now we search for the intersection of the two events in the sample space of B and we get this formula:
Two events are dependent if the happening of one of them changes the probability of the other one happening.
In this formula, the more B is independent from A, the more
P(N) = 175 / 590 = 0.296
P(S) = 570 / 590 = 0.966
P(S|N) = 160 / 175 = 0.914
Compute the probability that an individual did not wear a seatbelt and survived:
P(N and S) = P(N) x P(S|N) = 0.296 x (160/175) = 0.27
P(N and S) if independent = P(N) x P(S) = 0.296 x 0.966 = 0.286
Since those two are not equal, the events are not independent.
We are sometimes asked to find the mean of a probability distribution. That is the Expectation:
If we are asked to find the variance:
Literally the expectation of the squared difference of the points from the mean.
If you are asked to complete the distribution, remember that the y values must amount to 1.
Some exercises may ask you to calculate the probability in an interval of the Normal distribution.
I think you would do this with integrals, but i guess that integrating the normal distribution might not be easy?
Anyway, we have this exercise:
In a population the vehicle speed distribution is well approximated by a Normal curve with mean 50 and standard deviation 15.
Basically the z-score is how many standard deviations our value is away from the mean.
The z-score is useful because it is standardized for every normal curve.
Basically if we get an exercise like the one above, where we would need to use an integral, we have a table of ready-to-go values, the z-table.
The z-table assigns to every z-score the area under the curve up to that z-score(the left of it).
The table is computed from the standard normal curve, but the z-score is standardized, so if our distribution follows a normal curve we can use the table.
Getting back to the exercise:
Compute the probability that a randomly selected vehicle speed is greater than 73:
We use the entry -1.5 because if we used 1.53 we would get all the area to the left of 1.53, and we want the area to the right. We can use the negative z because the normal distribution is symmetric:
Our result is 0.063.
We could have also computed the complement of the area, instead of getting the inverse of the z-score.
This is because the total area of the normal curve is 1.
A binomial distribution can be thought of as simply the probability of a SUCCESS or FAILURE outcome in an experiment or survey that is repeated multiple times.
Binomial distributions must meet the following three criteria:
- If you toss a coin once, your probability of getting a tails is 50%.
- If you toss a coin a 20 times, your probability of getting a tails is very, very close to 100%.
Each observation or trial is independent.
In other words, none of your trials have an effect on the probability of the next trial.
The probability of success (tails, heads, fail or pass) is exactly the same from one trial to another.
The thing is imagine we flip a coin 5 times. There are
The formula to find out is:
Mean of a binomial distribution is given by:
With big sample sizes this formula approximates to a normal distribution, so we can use z-scores to find areas under the curve.
When people enter an apple store,
Imagine we sample the population and try to obtain p from the samples. Now P becomes uncertain, it is random variable
According to the Central Limit Theorem, for large samples, the sample proportion is approximately normally distributed, with mean:
Where:
If we are investingating a mean(not a proportion) the formula for standard deviation is:
Sometimes we want to compute the probability of successes being more than a certain number.
We know that we can get the area under a curve by using the z-scores, but this distribution only approximates a normal distribution when using a large n.
So when we have a small n we need to go sideways:
For the population of individuals who own an iPhone, suppose p = 0.25 is the proportion that has a given app.
Find the probability that the proportion of having the app is at least 0.75 when n = 4.
Here the sample size is too small, so we can't use the normal distribution stuff.
0.75 of 4 = 3, so we need the probability that at least 3 people have the app.
We do that by summing the probabilities that 3 people have the app and 4 people have the app.
Since those probabilities are discrete and there are only 2 possible outcomes per trial, we can use the binomial distribution formula
In the population, IQ scores are normally distributed with mean µ = 100 and variance σ 2 = 15. Suppose to draw a random samples of 25 individuals from the population and measure the IQ score
The standard deviation of the sample mean for a sample of size 25. Is given by:
The z-scores for 98 and 102 are:
The areas given by the z-tables for the z-scores are:
The area between the two z-scores is given by:
It's the average error of the estimation from the samples.
For the sample mean it is:
For the sample proportion it is:
Standard error and standard deviation are both measures of variability, but standard deviation is a descriptive statistic that can be calculated from sample data, while standard error is an inferential statistic that can only be estimated.
The confidence level is the overall capture rate if the method is used many times. The sample mean will vary from sample to sample, but the method estimate ± margin of error is used to get an interval based on each sample.
C% of these intervals capture the unknown population mean 𝜇.
In other words, the actual mean will be located within the interval C% of the time.
The population mean for a certain variable is estimated by computing a confidence interval for that mean.
The formula for the confidence interval is:
In order to find the confidence interval, we must find the margin of error first:
For each C% there is a specific z-score, that gives you the bounds of the interval.
The margin of error is just the un-normalized bound, because we are converting the z-score to a real value.
You can find the values for
Ex. When a General Social Survey asked 1326 subjects, "Do you believe in science?", the proportion who answered yes was 0.82.
Construct the 95% confidence interval.
The sample proportion is equal to 0.82, now we just need the standard error in order to calculate the margin of error.
This is the standard error formula for the sample proportion:
It's different from the one we use for the sample mean:
So the confidence interval would be
We are 95% confidence that between 79.8% and 84.2% of people believe in science.
If sample size increases, the margin of error decreases, and thus the CI becomes narrower.
The significance level is a threshold that determines whether a study result can be considered statistically significant after performing the statistical tests.
A
We expect to obtain a sample mean that falls in the critical region 5% of the time.
I think it is the z-value for the bounds of the critical regions.
If
So we need to find the z-value for the area 0.025 and that will be our "multiplier".
It's like a normal curve but wider.
You must use the t-distribution table(instead of z-table) when the population standard deviation is not known and the sample size is small (n<30).
General Correct Rule: If σ is not known, then using t-distribution is correct. If σ is known, then using the normal distribution is correct.
Do exercise 2 and the last ones. We did not exercises on t-distribution.
I will explain using an exercise:
In a sample of 402 Tor Vergata first-year students, 174 are enrolled into Statistics course.
Is the proportion of students enrolled into Statistics course in the population of all Tor Vergata first-year students different from 0.50 at the significance level α = 0.05?
The distribution approximates to a normal distribution because of the large sample size.
We are given an hypothesis:
In a significance test, the null hypothesis is presumed to be true unless the data give strong evidence against it.
A test statistic measures how far the point estimate falls from the parameter value given in the null hypothesis. The result is the number of standard errors between the two.
First, we construct the normal curve considering the hypothesis:
We can already see from the plot that this sample proportion really doesn't agree with our hypothesis.
Mathematically, to disprove the hypothesis we need to check if the sample mean/proportion lands beyond the significance level threshold.
In order to do just that, we need the z-score for the sample proportion, also called the test statistic:
Now we compute the areas under the curves to determine whether the sample proportion exceedes the
This is just the area under the curve at the left of the test statistic value. We can later compare it with the significance level to see if we really are out of bounds.
Because the significance level
So we can either use
Once we have the p-value, we proceed to either accept or reject the hypothesis by comparing the two area
In this case
A regression equation describes how the mean value of a Y-variable relates to specific values of the X-variable:
SSE stands for "Sum of Squared Errors", it is calculated as follows:
We are basically computing a sum of the residuals, once we have our prediction:
If we want to find the two parameters, we got to minimize the SSE.
We do that by computing
They are called the "least squares estimates" of
We use the notation
We use the notation
If we are trying to estimate how good our model is, we can use the "Mean Squared Error":
The MSE is the sample variance of the errors and estimates
Of course, then the standard deviation of errors is computed as:
Which is the sample standard deviation of the errors (residuals) from the regression line.
It can be interpreted as the average deviation of individuals from the sample regression line.
It is the total sum of squares:
It is basically the sum of all the squared deviations from the mean.
It is interpreted as the fraction of variation in y that is explained by the fitted regression equation. It is often converted to a percentage.